The following will make sure you have what you need to run the rest of the code.
library(tidyverse)
model_variables = read.csv('data/model_variables_anonymized.csv')
Use caret to employ machine learning
Start with some pre-processing of the data
library(caret) # need to install?
set.seed(1234) # so that the indices will be the same when re-run
trainIndices = createDataPartition(model_variables$libuser, p=.8, list=F)
X_train = model_variables %>%
slice(trainIndices)
X_test = model_variables %>%
slice(-trainIndices)
Example with XGBoost
library(xgboost) # need to install?
xgb_opts = expand.grid(
eta = c(.3, .4),
max_depth = c(9, 12),
colsample_bytree = c(.6, .8),
subsample = c(.5, .75, 1),
nrounds = 100, # 1000 would be more reasonable, but notably time consuming
min_child_weight = 1,
gamma = 0
)
cv_opts = trainControl(method='cv', number=10)
Run in parallel
# for parallel processing
library(doParallel) # need to install?
cl = makeCluster(detectCores() - 1)
registerDoParallel(cl)
results_xgb = train(
libuser ~ .,
data = X_train,
method = 'xgbTree',
preProcess = c('center', 'scale'),
trControl = cv_opts,
tuneGrid = xgb_opts
)
stopCluster(cl)
results_xgb
preds_gb = predict(results_xgb, X_test)
confusionMatrix(preds_gb, X_test$libuser, positive='yes')
With machine learning, we finally get to a point where Python is on par with and typically surpasses R.
Most techniques that would fall under the heading of machine learning are first developed in Python.
For at least some techniques, Python will typically run faster, possibly notably so, but this depends on many factors.
# note how when using something other than R, you have to specify the engine path
import pandas as pd
import numpy as np
import statsmodels
model_variables = pd.read_csv('data/model_variables_anonymized.csv')
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=1000) # number of trees
rf_opts = {'max_features': np.arange(2,7)} # tuning parameter
rf_estimator = GridSearchCV(rf, cv=10, param_grid=rf_opts, n_jobs=4) # 10-fold cv
results_rf = rf_estimator.fit(X_train, y_train) # requires matrices
Inspect the best result over the tuning parameters
results_rf.best_score_
results_rf.best_params_
Test model on new data
rf_predict = results_rf.predict(X_test)
print(metrics.classification_report(y_test, rf_predict))